Introduction

This document outlines a comprehensive analysis of single-cell RNA sequencing (scRNA-seq) data obtained from muscle biopsy-derived myocyte cultures. The study includes samples from FSHD patients (with FSHD1 and FSHD2 subtypes) and healthy controls. Our analysis includes data pre-processing, normalization, batch effect correction, clustering, differential expression analysis, pathway enrichment, and cell type annotation.

1. Setting Up the Environment

1.1 Clear the Environment and Set Options

We start by clearing the R environment to avoid any conflicts with existing objects and then set options to avoid truncated outputs and scientific notation.

# Clear all objects including hidden ones
rm(list = ls(all.names = TRUE))
# Free up memory
gc()
##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  631712 33.8    1454771 77.7         NA   700240 37.4
## Vcells 1171182  9.0    8388608 64.0      32768  1963325 15.0
# Set options to avoid truncated output and scientific notation
options(max.print = .Machine$integer.max, scipen = 999, stringsAsFactors = FALSE, dplyr.summarise.inform = FALSE, future.globals.maxSize = 12 * 1024^3) 

1.2 Load Required Libraries

Load the essential libraries for scRNA-seq analysis, data manipulation, visualization, normalization, batch effect correction, and cell type annotation.

library(Seurat)         # For single-cell analysis
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
## version is 1.7.2; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
library(tidyverse)      # For data manipulation and plotting
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Matrix)         # For handling sparse matrices
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(SingleR)        # For automated cell type annotation
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## 
## Attaching package: 'MatrixGenerics'
## 
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## 
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## 
## The following object is masked from 'package:SeuratObject':
## 
##     intersect
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## 
## The following objects are masked from 'package:Matrix':
## 
##     expand, unname
## 
## The following objects are masked from 'package:lubridate':
## 
##     second, second<-
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## 
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## The following object is masked from 'package:utils':
## 
##     findMatches
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## 
## The following object is masked from 'package:lubridate':
## 
##     %within%
## 
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## 
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## The following object is masked from 'package:sp':
## 
##     %over%
## 
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'Biobase'
## 
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## 
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## 
## 
## Attaching package: 'SummarizedExperiment'
## 
## The following object is masked from 'package:Seurat':
## 
##     Assays
## 
## The following object is masked from 'package:SeuratObject':
## 
##     Assays
library(celldex)        # Provides reference datasets for SingleR
## 
## Attaching package: 'celldex'
## 
## The following objects are masked from 'package:SingleR':
## 
##     BlueprintEncodeData, DatabaseImmuneCellExpressionData,
##     HumanPrimaryCellAtlasData, ImmGenData, MonacoImmuneData,
##     MouseRNAseqData, NovershternHematopoieticData
library(clusterProfiler) # For gene set enrichment analysis
## 
## clusterProfiler v4.14.4 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
## 
## Please cite:
## 
## S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang,
## W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
## multiomics data. Nature Protocols. 2024, 19(11):3292-3320
## 
## Attaching package: 'clusterProfiler'
## 
## The following object is masked from 'package:IRanges':
## 
##     slice
## 
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## 
## The following object is masked from 'package:purrr':
## 
##     simplify
## 
## The following object is masked from 'package:stats':
## 
##     filter
library(org.Hs.eg.db)   # Annotation package for human genes
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:clusterProfiler':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(sctransform)    # For advanced normalization methods
library(harmony)        # For batch effect correction
## Loading required package: Rcpp
library(viridis)        # For colorblind-friendly palettes
## Loading required package: viridisLite
library(patchwork)      # For combining plots
library(biomaRt)        # For gene annotation conversion

2. Data Loading and Initial Processing

2.1 Annotate Genes

We create a function annotate_genes to convert Ensembl gene IDs to HGNC gene symbols using the biomaRt package. This ensures that downstream analyses use recognizable gene symbols.

annotate_genes <- function(counts) {
  library(biomaRt)
  
  # Connect to the Ensembl database
  ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
  
  # Extract Ensembl IDs from row names of the count matrix
  ensembl_ids <- rownames(counts)
  
  # Retrieve the corresponding HGNC symbols
  gene_map <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                    filters = "ensembl_gene_id",
                    values = ensembl_ids,
                    mart = ensembl)
  
  # Replace empty gene symbols with NA
  gene_map$hgnc_symbol[gene_map$hgnc_symbol == ""] <- NA
  
  # Remove duplicate Ensembl IDs for a unique mapping
  gene_map <- gene_map[!duplicated(gene_map$ensembl_gene_id), ]
  
  # Handle duplicate gene symbols by appending Ensembl ID if needed
  gene_map$hgnc_symbol <- make.unique(ifelse(is.na(gene_map$hgnc_symbol), 
                                             gene_map$ensembl_gene_id, 
                                             gene_map$hgnc_symbol))
  
  # Create a named vector for mapping Ensembl IDs to gene symbols
  gene_map_named <- setNames(gene_map$hgnc_symbol, gene_map$ensembl_gene_id)
  
  # Replace row names with gene symbols where available
  rownames(counts) <- ifelse(rownames(counts) %in% names(gene_map_named),
                             gene_map_named[rownames(counts)],
                             rownames(counts))
  
  # Ensure unique row names to avoid duplication issues
  rownames(counts) <- make.unique(rownames(counts))
  
  return(counts)
}

2.2 Create Seurat Objects

The function create_seurat_object reads the count data from a file, annotates the genes, and then creates a Seurat object with initial filtering. Metadata for sample and condition is also added.

create_seurat_object <- function(file_path, sample_name) {
  # Read the count matrix data from a text file (.txt.gz)
  counts <- read.table(file_path, header = TRUE, sep = "\t", row.names = 1, check.names = FALSE)
  
  # Annotate gene names using the custom function
  counts <- annotate_genes(counts)
  
  # Create a Seurat object with filters:
  # - Genes must be detected in at least 3 cells
  # - Cells must have at least 200 detected genes
  seurat_obj <- CreateSeuratObject(
    counts = counts,
    project = sample_name,
    min.cells = 3,
    min.features = 200
  )
  
  # Add metadata: sample name and condition (FSHD1, FSHD2, or Control)
  seurat_obj$sample <- sample_name
  seurat_obj$condition <- ifelse(grepl("FSHD", sample_name), "FSHD", "Control")
  return(seurat_obj)
}

2.3 Merge Samples

Read count data from multiple files, create Seurat objects for each sample, and merge them into one combined Seurat object for the complete analysis.

# List all sample files from the directory (adjust the path as needed)
sample_files <- list.files("GSE122873_RAW", full.names = TRUE, pattern = "*.txt.gz")

# Define sample names corresponding to the files
sample_names <- c("FSHD1_1", "FSHD1_2", "FSHD2_1", "FSHD2_2", "Control_1", "Control_2")

# Create a list of Seurat objects for each sample using mapply
seurat_objects <- mapply(create_seurat_object, 
                         sample_files, 
                         sample_names,
                         SIMPLIFY = FALSE)
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
## Warning: Data is of class data.frame. Coercing to dgCMatrix.
# Merge the individual Seurat objects into a single Seurat object
merged_seurat <- merge(seurat_objects[[1]], 
                       y = seurat_objects[2:length(seurat_objects)])
## Warning: Some cell names are duplicated across objects provided. Renaming to
## enforce unique cell names.

3. Quality Control and Filtering

3.1 Calculate Mitochondrial Percentage and Filter Cells

We calculate the mitochondrial gene percentage for each cell and remove low-quality cells based on gene detection count and mitochondrial content.

# Calculate the percentage of mitochondrial gene expression
merged_seurat$percent.mt <- PercentageFeatureSet(merged_seurat, pattern = "^MT-")

# Filter cells based on:
# - More than 200 features (genes) and fewer than 6000 features
# - Less than 15% mitochondrial gene expression
merged_seurat <- subset(merged_seurat, subset = nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 15)

3.2 Normalize Data and Perform PCA/UMAP

We normalize the data using SCTransform, perform principal component analysis (PCA) to reduce dimensionality, and then generate a UMAP for visualization.

# Normalize data and regress out the mitochondrial percentage using SCTransform
merged_seurat <- SCTransform(merged_seurat, vars.to.regress = "percent.mt", verbose = FALSE)

# Perform PCA on the normalized data
merged_seurat <- RunPCA(merged_seurat, assay = "SCT", verbose = FALSE)

# Generate a UMAP embedding for visualization of the data structure
merged_seurat <- RunUMAP(merged_seurat, dims = 1:30, verbose = FALSE)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

4. Batch Effect Correction

Correct for batch effects (e.g., differences across samples) using Harmony, then update UMAP visualization.

# Correct batch effects based on the 'sample' metadata column using Harmony
merged_seurat <- RunHarmony(merged_seurat, group.by.vars = "sample", verbose = FALSE)

# Update the UMAP embedding after Harmony correction
merged_seurat <- RunUMAP(merged_seurat, dims = 1:30, verbose = FALSE)

# Visualize the data colored by sample to check for batch correction
DimPlot(merged_seurat, reduction = "umap", group.by = "sample")

5. Clustering and Cell Type Identification

5.1 Clustering on Batch-Corrected Data

Perform clustering on the Harmony-corrected data. This involves re-running UMAP using the Harmony reduction, identifying cell neighbors, and clustering cells.

# Generate UMAP using Harmony reduction
merged_seurat <- RunUMAP(merged_seurat, reduction = "harmony", dims = 1:20, verbose = FALSE)

# Identify cell neighbors and perform clustering
merged_seurat <- FindNeighbors(merged_seurat, reduction = "harmony", dims = 1:20)
## Computing nearest neighbor graph
## Computing SNN
merged_seurat <- FindClusters(merged_seurat, resolution = 0.5, verbose = FALSE)

# Visualize clusters and sample distribution on UMAP
DimPlot(merged_seurat, reduction = "umap", group.by = "sample")

DimPlot(merged_seurat, reduction = "umap", group.by = "seurat_clusters")

# Define marker genes
myogenic_markers <- c("MYF5", "MYOD1", "MYOG", "MYH3")  # Myogenesis markers
fibroblast_markers <- c("ANPEP", "COL1A2", "VIM")  # Fibroblast markers

# Generate UMAP plots
p1 <- FeaturePlot(merged_seurat, features = myogenic_markers, cols = c("lightgray", "blue"), min.cutoff = "q10", max.cutoff = "q90", reduction = "umap")

p1

p2 <- FeaturePlot(merged_seurat, features = fibroblast_markers, cols = c("lightgray", "red"), min.cutoff = "q10", max.cutoff = "q90", reduction = "umap")

p2

5.2 Differential Expression Analysis

Before comparing FSHD and Control samples, set the active cell identity to the condition metadata column. Then, perform differential expression analysis between the "FSHD1" and "Control" groups.

# Set the active identity to the 'condition' column for differential expression
Idents(merged_seurat) <- merged_seurat$condition

# Prepare the object for differential expression analysis using SCT assay
merged_seurat <- PrepSCTFindMarkers(merged_seurat, assay = "SCT", verbose = FALSE)

# Now perform differential expression analysis comparing FSHD2 versus Control using the SCT assay
de_genes <- FindMarkers(merged_seurat, ident.1 = "FSHD", ident.2 = "Control", assay = "SCT")
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
# Display the top differentially expressed genes
head(de_genes)
##        p_val avg_log2FC pct.1 pct.2 p_val_adj
## RPL36A     0   1.387597 0.957 0.688         0
## RPS21      0   1.788058 0.959 0.698         0
## RPS29      0   1.460634 0.936 0.711         0
## RPL38      0   1.438665 0.972 0.818         0
## RPL23      0   1.272439 0.992 0.867         0
## RPS10      0   1.130713 0.988 0.894         0
# Extract top differentially expressed genes (e.g., top 5 genes based on log2FC)
top_de_genes <- rownames(de_genes[order(de_genes$avg_log2FC, decreasing = TRUE), ])[1:5]

# Generate UMAP feature plots for the top DE genes
p_de <- FeaturePlot(merged_seurat, features = top_de_genes, cols = c("lightgray", "purple"), 
                    min.cutoff = "q10", max.cutoff = "q90", reduction = "umap")

# Display the plots
p_de

5.3 Pathway Enrichment Analysis

Using the differentially expressed genes, we create a ranked gene list and perform gene set enrichment analysis (GSEA) for Gene Ontology Biological Processes (GO BP).

# Create a ranked list of genes based on average log2 fold-change
gene_list <- de_genes$avg_log2FC
names(gene_list) <- rownames(de_genes)
gene_list <- sort(gene_list, decreasing = TRUE)

# Perform GSEA on GO Biological Process terms
go_enrichment <- gseGO(geneList = gene_list,
                       ont = "BP",
                       keyType = "SYMBOL",
                       minGSSize = 10,
                       maxGSSize = 500,
                       pvalueCutoff = 0.05,
                       verbose = FALSE,
                       OrgDb = org.Hs.eg.db,
                       pAdjustMethod = "BH")
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (18.83% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 9 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 0.0000000001.
## You can set the `eps` argument to zero for better estimation.
# Visualize the top enriched pathways using a dot plot
dotplot(go_enrichment, showCategory = 20)

5.4 Cell Type Annotation with SingleR

Annotate cell types using the SingleR package and the Human Primary Cell Atlas as a reference.

# Load reference dataset for cell type annotation
ref <- HumanPrimaryCellAtlasData()

# Annotate cell types using SingleR on RNA assay data
predictions <- SingleR(test = GetAssayData(merged_seurat, assay = "RNA"),
                       ref = ref,
                       labels = ref$label.main)
## Warning in GetAssayData.StdAssay(object = object[[assay]], layer = layer): data
## layer is not found and counts layer is used
# Add the predicted cell type labels to the Seurat object metadata
merged_seurat$SingleR.labels <- predictions$labels

# Visualize the annotated cell types on UMAP
DimPlot(merged_seurat, reduction = "umap", group.by = "SingleR.labels")

Conclusion

This analysis provides insights into the transcriptomic differences between FSHD patient samples and healthy controls. Through data normalization, batch effect correction, clustering, differential expression, pathway enrichment analysis, and cell type annotation, we have highlighted key molecular pathways and cell populations. Future work could include integrating multi-omics data to further elucidate the mechanisms underlying FSHD.